Using Vignettes to Measure the Quality of Health Care

Jishnu Das (World Bank)

Kenneth L. Leonard (Department of Agricultural and Resource Economics, University of Maryland)

 

I. Introduction

No matter how one looks at it—as differences across nations or as differences within nations—poor people systematically suffer from worse health outcomes than rich people. What role does medical care play?

 

This note outlines a research project that seeks to measure the quality of care, understand how quality of care varies by geographical location and sectors (private, public or Non-Governmental Organizations) and how (and whether) quality of care has an impact on health choices and outcomes. We discuss an instrument, vignettes, and a measure of the quality of care, competence, which focus on what doctors know, or, the maximum quality of medical advice that doctors could provide if they did all they knew to do. Vignettes are simulated patients presented to a doctor paired with an instrument that evaluates the quality of care provided by that doctor. Performance on the vignette is an indicator of the doctor’s competence, skill or ability. We show how competence can be validated and what can be learned by looking at correlations between competence and various attributes of the health-care provider. We propose ways in which this measure can be widely collected, at the same time arguing for (some) uniformity in cross-country studies to enable wider comparisons.

 

The note is structured as follows. Section II presents a prima facie case for (a) incorporating the quality of care in studies of the demand for health care and outcomes and (b) measuring the quality of care through the quality of medical advice that doctors give to patients, rather than (for instance), the infrastructure in a facility. Section III introduces vignettes as a measurement tool, describing how this data is collected and validated. Section IV presents results from recent studies; Section VI concludes with some lessons learnt, caveats and thoughts for further research.

 

II. Why and how should we measure quality of care?

Numerous studies have documented the role of households in producing good health outcomes—children are healthier when mothers are more educated; rich households are better able to “insure” against health shocks; rich households live in areas with better sanitation and enjoy better nutrition. Based on these studies, the explanations for health outcomes among poor people have centered almost exclusively on household choices: either poor people do not use the health system as much as they should or if they do go to doctors it’s usually when it’s too late. However, recent work shows that even when the poor do visit health facilities frequently and often more frequently than the rich, their health outcomes remain dismal, the quality of the medical system must also play a large role in health outcomes.

 

Earlier studies sought to measure the quality of care through the presence or absence of a primary health care center, and found little or no relationship between the existence of a health care center and health outcomes. The lack of a relationship left many questions about providers unanswered: Was the lack of a relationship because the doctor was never there? Was the doctor qualified (holding a degree) and competent (knowledgeable)? The data to answer these crucial questions simply didn't exist.

 

The next set of studies tried to address these questions by using “structural” measures of quality; that is, quality alternatively defined by physical infrastructure, the stock of medical supplies, the total number of assigned personnel, the availability of refrigeration units, the availability of electricity or a combination of some of these (Collier and others 2003; Lavy and Germain 1994). Both studies found that health-care demand responded to structural quality—more people visited health clinics when the structural quality was higher.

 

A remarkable omission from these indicators is any measure of process quality, particularly the quality of medial personnel. If structural quality was well correlated with process quality, this omission could be explained because it is easier to collect data on structural quality. However, they are not well correlated and there is good reason to believe that process quality is more important than structural quality. First, structural measures such as drug availability are largely determined by the degree of subsidy and cost of transportation, making structural quality a predictable feature of owner and location, whereas process quality is more likely to vary within these parameters. To the degree that one facility is more likely to experience pharmacy stock-outs than another similar facility, it is likely to because demand is high causing misclassification. Second whereas both medicine and consultation are likely to be important to a patient’s health, households can mitigate problems with drug supply through purchases from other markets, whereas they cannot do this with medical care (see for example, Foster 1995).

 

We propose to measure the (maximum) quality of medical advice a patient is likely to receive when s/he consults a doctor and the correlates of this quality. This is harder to collect than structural quality since it typically involves either detailed interviews with the doctor and/or clinical observation of interactions between the doctor and a number of patients. Together with structural quality this research presents a more “complete” picture of the quality of medical advice

 

III. Process Quality Evaluation Methods

 

Why use vignettes?

There are many ways to measure process quality, and these instruments, both actual and theoretical, vary according to their realism, relevance and comparability. A realistic quality measurement instrument is one that collects data on doctor’s activities in a setting that closely resembles the setting in which most care is delivered. A relevant instrument collects data on processes that matter in the sense that the observed activities are important to outcomes for a large segment of the potential population. A comparable instrument is one that collects data that can be compared across a broad spectrum of health care providers and settings.

In practice, some compromise across these goals is inevitable. For example, the fake patient (in which an actor posses as a patient and visits a large number of providers) is both comparable (the same patient visits all providers) and realistic (the doctor is unaware of the fact that this patient is not a regular patient). However, the fake patient is unlikely to be relevant because an actor can only pretend to be sick with a very limited set of illnesses (a headache or sore muscle, for example) and these illnesses are rarely the subject of our research. A fake patient cannot convincingly fake tuberculosis, for example. Direct clinician observation (observing the activities of doctors with their regular patients) is realistic[1] and relevant, but is not generally comparable because strong assumptions are necessary to compare the activities of doctors who see very different types of illnesses and patients. Vignettes – the subject of this note – are specifically designed to be both comparable and relevant because the same case is presented to every doctor and the researcher can choose to present almost any relevant illness. However, vignettes are not as realistic as other instruments because doctors know that the patient is an actor pretending to be sick, not a regular patient. We argue that this shortcoming can be mitigated through proper design of the instruments and proper interpretation of the results.

            Vignettes will play an important role in investigations where relevance and comparability are overriding concerns. Research focused on the distribution and determinants of the distribution of quality on a regional, national or international scale must be both relevant and comparable and is likely to benefit from the inclusion of data generated by the use of vignettes. On the other hand, an empirical investigation of health seeking behavior in a particular population or setting will put a higher premium on realism than relevance or comparability and would find vignettes less well suited to the problem than other instruments might be.

 

What are Vignettes?

There are many different types of vignettes used in health care research today. The underlying element that connects these different instruments is the presentation of a “case” to the doctor paired with an instrument that evaluates the activities of the doctor. In some versions of vignettes, the case is a fixed narrative read from a script and in others someone pretending to be the patient acts out the case. In some types of vignettes, the doctor is asked to list the activities that he or she would undertake and in other types of vignettes, the doctor interacts with the “patient” by asking questions to which the “patient” responds. And, in some types of vignettes, the doctor is prompted by the examiner with questions such as “would you prescribe treatment X?”

            The vignettes represented in Das and Hammer (2005) and Leonard and Masatu (2005) use an enumerator trained to act as a sick person rather than having an enumerator read from a script. The characteristics of the illness and patient are predetermined but unknown to the doctor. Except for the first complaints described by the patient, the practitioner must ask questions to discover the characteristics of the illness. Because the patient is not actually sick, physical examination is in question, answer format where the doctor explains what he is looking for, and the patient tells him what he would find. The measured quality of consultation is based on the doctor’s activities during the consultation. Because doctors know the patient is simulated and physical examination is done through question and answer, some realism is sacrificed. However, unlike some other types of instruments, the process by which a doctor examines and diagnoses and patient resembles the actual process a doctor would normally undertake. Thus, even though the doctor knows that the simulated patient is not real, the doctor is sitting at his or her desk and gathering information about a case in order to reach a diagnosis and prescribe a treatment. Our version of vignettes is as realistic as we can make it without sacrificing comparability or relevance.

 

How easy/hard is it to use vignettes?

            It is impossible to measure process quality accurately without visiting all facilities in a sample and vignettes require that doctors be present at the time of the visit. However, once a doctor has been located, vignettes are relatively inexpensive to administer. However, the increased realism of our vignettes does have an important cost; enumerators must completely memorize a case presentation and be sufficiently well trained to adapt to the different questions posed by practitioners while still maintaining the same characteristics across all practitioners. This is much more challenging that training someone to read a list of questions off an instrument.

 

IV. Validating Vignettes

 

Do vignettes measure competence and is competence as measured by vignettes correlated with important underlying aspects of quality? To answer these questions, the validity of vignettes can be checked for internal consistency and the results obtained using vignettes can be compared to results obtained used more realistic instruments, in this case direct observation.

 

Internal Validity with Item Response Theory

The researcher simultaneously designs the case study patient and the diagnostic and treatment protocol. In some cases (Tanzania, for example) the protocol for certain types of illnesses already exists in the form of national diagnostic guidelines. Thus, rather than measuring competence by the number of things a doctor does, the researcher measures the percentage of required or rational items that are actually implemented by the doctor. This data can be translated into a series of correct/incorrect responses, where correct implies that a doctor used an item required by protocol and incorrect implies that he did not use an item required by protocol. Any questions that are not on the protocol are not used in this process. This list of correct and incorrect responses lends itself to internal validation through item response analysis or theory (IRT). IRT analysis simultaneously solves for a single overall score for each doctor, a standard error for the score and the relative contribution of each individual item to that score (Das and Hammer, 2005). An internally valid vignette has a number of testable features. First, for every item on the vignette, good doctors (those with a high overall score) should be at least as likely to implement the item as are bad doctors (those with a low overall score). Second, the standard errors associated with overall scores should be small enough that there exist significant differences in doctor’s scores across the sample. Such tests are possible because a vignette measures multiple aspects of the same underlying characteristic. The vignettes discussed in Das and Hammer (2005) and Leonard and Masatu (2005) are internally valid by these standards. Although the design of the vignettes used in Indonesia (Barber, Gertler and Harimuti, 2006a,b) is slightly different, they also lend themselves to IRT analysis and are internally valid by these standards[2].

 

Comparing Vignettes and Direct Observation

We have been careful to cast vignettes as a measure of competence, not a measure of actual practice quality. Leonard and Masatu (2005) compare the performance of doctors on vignettes and their performance in practice with their regular patients. This comparison suggests that performance on the vignette is a measure of maximum performance in practice but not a measure of actual performance. However, the vignette score and the practice quality score are significantly correlated over the sample; a doctor’s actual performance is closely tied to his potential or maximum performance. Despite their lack of realism, vignettes are a good first order measure of practice quality. That said, the differences between competence and practice quality are potentially important and Leonard, Masatu and Vialou (2006) suggest that the gap between ability and practice is partially determined by the organizational form of the health facility. Comparing quality within organizational type (public, NGO or private) with vignettes is probably more valid than comparing quality across organizational types.

 

Additional sources of validation

In Tanzania, doctors who were examined with the vignette were asked to provide a diagnosis at the end of each case. Because all practitioners examined were trained in the same medical system, their diagnoses can be compared to the ‘true’ diagnosis for that case. Practitioners who scored better on the vignette were significantly more likely to get the right diagnosis (Leonard and Masatu 2006b). In addition, Barber and Gertler (2006a) find, in Indonesia, that health outcomes are worse in facilities with lower vignette scores. These results provide additional evidence that measuring competence does in fact contain some information about quality.

 

V. Some Results

 

Now that we have a measure of quality based on competence, what do we do with it? As a first pass, we can try to benchmark the quality of care, and provide some information about whether care is high quality or not. Next, we can try to look at difference in the quality of care, perhaps across geographical or income groups.

 

Results on the baseline quality of care

 

Despite the evidence that performance on vignettes is likely to be an upper bound, the overall quality of care is low, although there is considerable variation across countries and even within countries over time. In India, doctors completed only 26 percent of the tasks that were medically required for a patient presenting with tuberculosis and only 18 percent for a child with diarrhea (Das and Hammer, 2006). Similarly, doctors in Tanzania completed less than one quarter (24 percent) of the essential checklist when faced with a patient with malaria and 38 percent for a child with diarrhea (Leonard and Masatu, 2006b). Indonesian providers perform better than those in India for both patients presenting with tuberculosis and diarrhea, but presents a disturbing time-trend. As Barber, Gertler and Harimurti (2006b) document, a virtual hiring freeze in the public sector led to a decline in the quality of care between 1993 and 1997, with a 7 percent drop in the percentage of checklist items completed in the vignettes.

 

Results on correlates of quality of care across countries

 

Though the quality of care is generally low, it is not evenly distributed. In Delhi, high quality doctors appear to be self-sorting into locations where they serve richer clients, even within the public health system. In Indonesia, quality is much higher in the relatively wealthier Java Bali regions than other regions (Barber, Gertler and Harimuti, 2006a). In Tanzania, the quality of care available in the rural areas is much lower than that available in the urban areas (Leonard and Masatu, 2006b). In addition, in Tanzania, the NGO sector delivers marginally better health care, but also manages to provide relatively constant quality across the rural and urban divide (Leonard and Masatu, 2006b).

 

In addition, there is little evidence to suggest that purely private health care is better than public health care. Although the private care available in urban areas and the wealthy areas of towns is generally superior to that available in the rural or poor neighborhoods, this is no different than the pattern in public health facilities. Private providers in the rural and poor areas are not superior to public providers in those areas (Das and Hammer, 2006; Leonard and Masatu, 2006b).

 

VI. Discussion

 

Do vignettes measure aspects of medical quality that matter?

Clearly, vignettes do not measure everything that is important in health care and a measure of practice quality that was more realistic while retaining the relevance and comparability of vignettes would be desirable. Nonetheless, vignettes make an important contribution to knowledge because they allow some understanding of the distribution of competence, which in turn, is correlated with the distribution of practice quality. In addition, competence as measured by vignettes is not a function of location. The exact same case is presented to urban and rural doctors, doctors accustomed to treating poor patients and doctors accustomed to treating rich patients, etc. The differences in competence across doctors may be highly correlated with location, but they are not caused by location.

            Other measures of process quality are more susceptible to being endogenously determined by location. For example, educated patients are more likely to argue with doctors or insist on providing information that might be of use to the doctor. Thus, a doctor who sees mostly educated patients is more likely to follow protocol simply because their patients either encourage it or make it easier. Whether allowing the education level of the patient to influence quality is a good thing or not, it is clearly caused by the patient mix not by the skills of the doctor. The vignette can avoid this problem because it controls for illness and patient characteristics. The research can choose to implement the vignette with an informative or an uninformative patient, but the same patient will be presented to all doctors.   

            The distinction between quality that is poor because of the location and quality that is poor at a location is important to policy. The results that we have found with vignettes suggest that lower quality doctors locate near poor patients. This is true, even within the public sector and suggests that poor quality doctors are sent to work with poorer patients. From the perspective of the patient, it does not matter whether a doctor learned to be good from his experience with other patients or whether he was good when he was sent there, but for the administration of the public health care system, this difference is very important and can only be uncovered through the use of an instrument like vignettes.

 

The importance of quality measures with standard errors

Process quality measures of any type are likely to suffer from measurement error, and vignettes are no exception. Unlike many other measures of process quality, however, we can approximate the error in our measure of competence using IRT analysis. IRT analysis on the vignettes collected in Indonesia, India and Tanzania shows that measurement error is not constant across the sample within each country. In particular, the standard errors are much lower in the middle of the competence distribution and higher at the tails. Figure 1 shows the distribution of the information score across competence, for each vignette in each of these countries. The information score is a function of the inverse of the standard error; therefore, high information scores indicate low standard errors. In India, the vignettes are not very good at distinguishing among low competence providers, though they are very good at distinguishing among high competence and between high and low competence providers. Though this result appears to be specific to India, that survey included informal sector providers who dominate the lower tail and the results shown for India would likely be obtained in any survey that includes such providers. It is easy to identify poorly trained providers, but much more difficult to measure the differences among these providers. In Tanzania and Indonesia, vignettes are less useful at the tails of the distribution and better in the middle. In general, it is difficult to tell the differences among doctors with very low competence and among doctors with very high competence, though it is not difficult to tell the difference between low and high competence providers. From a policy perspective, not being able to classify correctly the best doctors is much less important than trying to classify the worst doctors. Any policy that publishes information gathered from vignettes would have to be very careful to interpret findings in light of what we know about the standard errors. In addition, if policy makers believe that the differences among low quality providers are economically important, we recommend that they use a different instrument to differentiate among these providers.

            It is tempting to look at the results obtained from vignettes and draw the conclusion that standard errors are too high. However, it is not the case that other instruments have lower standard errors, only that many of these instruments have unmeasured or un-measurable standard errors.

 

Lessons for the design of vignettes

The information scores show in Figure 1 give a picture of the accuracy of the different case studies and their contributions to the overall assessment of competence.  The patterns shown suggest that even for illnesses that are relevant and comparable, some cases are more useful than others. The TB vignette in India and the IUD and PNC (pre-natal care) vignettes in Indonesia stand out in comparison to the other vignettes in those countries. Information is additive, so even a vignette with a very low total contribution to overall competence scores (such as the depression vignette in India), does contribute something. However, given the cost in developing vignettes and training the personnel, it is clear that the value of Pre-eclampsia and TB vignettes in India and the IUD and PNC vignettes in Indonesia is greater than the value of other vignettes in those countries. However, in Tanzania, with the exception of the PID vignette (pelvic inflammatory disease), each vignette is useful for a different segment of the distribution of competence. Information about the usefulness of vignettes can be gathered in a pre-testing stage and used to choose between vignettes, trading off the contribution to total information and the cost of data collection.

 

Issues for the implementation of vignettes on a international scale

As we have alluded to above, implementing vignettes requires extensive training and some degree of adaptation to the local setting. The questions that an actor should be prepared to answer will generally differ from one country to another, particularly for questions that are not medically relevant but for which answers must be standardized. Thus, it is unlikely to be the case that an entire vignette manual could be designed that would work anywhere in the world. In addition, illnesses such as malaria are very different across different regions and would be difficult to standardize. However, there are potentially important benefits to using the same cases across countries. TB is the same in India or Indonesia and doctors are only graded on whether or not they ask medically relevant questions. The fact that a doctor in Indonesia may ask questions about religious practices and a doctor in India may ask questions about the food eaten for dinner the previous night should not prevent us from comparing competence across doctors albeit with less confidence than we can compare doctors within countries. Since there is greater variance in training, placement and incentives across than within countries, these studies can only improve our understanding of how competence is delivered.

 

References

Banerjee, A., A. Deaton, and E. Duflo, “Health care delivery in rural Rajasthan,” Economic and Political Weekly, 2004, pp. 944–950.

Banerjee, A., A. Deaton, and E. Duflo, “Wealth, health and health services in rural Rajasthan,” American Economic Review, 2004, 94 (2), 326–330.

Barber, Sarah L., Paul J. Gertler, and Pandu Harimurti, “Promoting high quality care in Indonesia: Roles for public and private ambulatory care providers,” mimeo, U.C.  Berkeley 2006a.

Barber, Sarah L., Paul J. Gertler, and Pandu Harimurti, “The effect of the zero growth policy in civil service recruitment on the quality of care in Indonesia,” mimeo, U.C. Berkeley 2006b.

Collier, P., S. Dercon, and J. MacKinnon, “Density versus quality in health care provision: using household data to make budgetary choices in Ethiopia,” World Bank Economic Review, 2002.

Das, Jishnu and Jeffrey Hammer, “Which Doctor?: Combining Vignettes and Item Response to Measure Doctor Quality,” Journal of Development Economics, 2005, 78, 348– 383.

Das, Jishnu and Jeffrey Hammer, “Location, location, location: Residence, Wealth and the Quality of Medical Care in Delhi, India,” Processed manuscript, The World Bank, 2006.

Filmer, D., J. Hammer, and L. Pritchett, “Weak links in the chain: a diagnosis of health policy in poor countries,” World Bank Research Observer, 2000, 25 (2), 199–224.

Leonard, Kenneth L., and Melkiory C. Masatu, “Comparing vignettes and direct clinician observation in a developing country context,” Social Science and Medicine, 2005, 61 (9), 1944–1951.

Leonard, Kenneth L., and Melkiory C. Masatu, “Outpatient process quality evaluation and the Hawthorne effect,” Social Science and Medicine, 2006a, 63 (9), 2330–2340.

Leonard, Kenneth L., and Melkiory C. Masatu, “Variation in the quality of care accessible to rural communities in Tanzania,” processed manuscript, University of Maryland, 2006b.

Leonard, Kenneth L., Melkiory C. Masatu and Alex Vialou, “Getting Doctors to do their best: the roles of ability and motivation in health care,” processed manuscript, University of Maryland, 2006.

 

 


Figure 1 Information by Vignette and Country



[1] Leonard & Masatu (2006a) show that the presence of the researcher can impact the activities of the doctor in a way that call into question the realism of the process, but these impacts can be controlled for.

[2] The vignettes used in Indonesia followed a script rather than using an actor. In addition, vignettes were administered at the facility level, not the doctor level. In this manner, one doctor at a facility may have participated in one vignette and another on the other vignette, but only facilities are identified.